Photo by Nathan Anderson on Unsplash
Kaggle Description:
The actual forest cover type for a given 30 x 30 meter cell was determined from US Forest Service (USFS) Region 2 Resource Information System data. Independent variables were then derived from data obtained from the US Geological Survey and USFS. The data is in raw form (not scaled) and contains binary columns of data for qualitative independent variables such as wilderness areas and soil type.
This study area includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. These areas represent forests with minimal human-caused disturbances, so that existing forest cover types are more a result of ecological processes rather than forest management practices.
The variables in this data set are:
| Variable Name | Description | Description of Values |
|---|---|---|
Elevation |
Elevation in meters | |
Aspect |
Aspect in degrees azimuth | |
Slope |
Slope in degrees | |
Horizontal_Distance_To_Hydrology |
Horz Dist to nearest surface water features | |
Vertical_Distance_To_Hydrology |
Vert Dist to nearest surface water features | |
Horizontal_Distance_To_Roadways |
Horz Dist to nearest roadway | |
Hillshade_9am |
Hillshade index at 9am, summer solstice | (0 to 255 index) |
Hillshade_Noon |
Hillshade index at noon, summer solstice | (0 to 255 index) |
Hillshade_3pm |
Hillshade index at 3pm, summer solstice | (0 to 255 index) |
Horizontal_Distance_To_Fire_Points |
Horz Dist to nearest wildfire ignition points | |
Wilderness_Area |
Wilderness area designation | (4 binary columns, 0 = absence or 1 = presence) |
Soil_Type |
Soil Type designation | (40 binary columns, 0 = absence or 1 = presence) |
Cover_Type |
Forest Cover Type designation | (7 types, integers 1 to 7) |
| Code | Name |
|---|---|
| 1 | Rawah Wilderness Area |
| 2 | Neota Wilderness Area |
| 3 | Comanche Peak Wilderness Area |
| 4 | Cache la Poudre Wilderness Area |
| Code | Name | Code | Name |
|---|---|---|---|
| 1 | Cathedral family - Rock outcrop complex, extremely stony. | 21 | Typic Cryaquolls - Leighcan family, till substratum complex. |
| 2 | Vanet - Ratake families complex, very stony. | 22 | Leighcan family, till substratum, extremely bouldery. |
| 3 | Haploborolis - Rock outcrop complex, rubbly. | 23 | Leighcan family, till substratum - Typic Cryaquolls complex. |
| 4 | Ratake family - Rock outcrop complex, rubbly. | 24 | Leighcan family, extremely stony. |
| 5 | Vanet family - Rock outcrop complex complex, rubbly. | 25 | Leighcan family, warm, extremely stony. |
| 6 | Vanet - Wetmore families - Rock outcrop complex, stony. | 26 | Granile - Catamount families complex, very stony. |
| 7 | Gothic family. | 27 | Leighcan family, warm - Rock outcrop complex, extremely stony. |
| 8 | Supervisor - Limber families complex. | 28 | Leighcan family - Rock outcrop complex, extremely stony. |
| 9 | Troutville family, very stony. | 29 | Como - Legault families complex, extremely stony. |
| 10 | Bullwark - Catamount families - Rock outcrop complex, rubbly. | 30 | Como family - Rock land - Legault family complex, extremely stony. |
| 11 | Bullwark - Catamount families - Rock land complex, rubbly. | 31 | Leighcan - Catamount families complex, extremely stony. |
| 12 | Legault family - Rock land complex, stony. | 32 | Catamount family - Rock outcrop - Leighcan family complex, extremely stony. |
| 13 | Catamount family - Rock land - Bullwark family complex, rubbly. | 33 | Leighcan - Catamount families - Rock outcrop complex, extremely stony. |
| 14 | Pachic Argiborolis - Aquolis complex. | 34 | Cryorthents - Rock land complex, extremely stony. |
| 15 | unspecified in the USFS Soil and ELU Survey. | 35 | Cryumbrepts - Rock outcrop - Cryaquepts complex. |
| 16 | Cryaquolis - Cryoborolis complex. | 36 | Bross family - Rock land - Cryumbrepts complex, extremely stony. |
| 17 | Gateview family - Cryaquolis complex. | 37 | Rock outcrop - Cryumbrepts - Cryorthents complex, extremely stony. |
| 18 | Rogert family, very stony. | 38 | Leighcan - Moran families - Cryaquolls complex, extremely stony. |
| 19 | Typic Cryaquolis - Borohemists complex. | 39 | Moran family - Cryorthents - Leighcan family complex, extremely stony. |
| 20 | Typic Cryaquepts - Typic Cryaquolls complex. | 40 | Moran family - Cryorthents - Rock land complex, extremely stony. |
| Code | Name |
|---|---|
| 1 | Spruce/Fir |
| 2 | Lodgepole Pine |
| 3 | Ponderosa Pine |
| 4 | Cottonwood/Willow |
| 5 | Aspen |
| 6 | Douglas-fir |
| 7 | Krummholz |
Photo by Luca Bravo on Unsplash
Trees bring incredible value to life on Earth. They have several important functions, including:
We need to help trees help us. Accurately predicting forest cover types is important for more efficient and effective tree management so that trees can continue to do a good job of nourishing the planet and supporting our lives. Accurate prediction can help with tasks including:
Goal: Our goal is to build a well-performing predictive model, learn about properties of different classifiers, and learn about the model building process along the way.
We aim to explore different classifiers and make meaningful comparisons between model performance based on our knowledge of the dataset.
Our model evaluation criteria:
import warnings
warnings.filterwarnings("ignore")
# Data Structures and Feature Processing
import numpy as np
import pandas as pd
from sklearn.preprocessing import scale
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
# Model Selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
# Metrics
from sklearn import metrics
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
# Plotting
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from plotly import __version__
import plotly.express as px
from IPython.display import Javascript
Javascript(
"""require.config({
paths: {
plotly: 'https://cdn.plot.ly/plotly-latest.min'
}
});"""
)
from IPython.display import Javascript
from plotly.offline import get_plotlyjs
Javascript(get_plotlyjs())
# Algorithms
from sklearn import linear_model
from sklearn.linear_model import Perceptron
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
forest_data_labeled_raw = pd.read_csv('data/train.csv', sep=",")
forest_data_labeled_raw.head(10)
We'll only be doing EDA on the training dataset.
forest_data_no_id_cover_type = forest_data_labeled_raw.iloc[:,1:-1] # remove Id and Cover_Type columns
# Split labeled data into train, development, and test sets
X, y = forest_data_no_id_cover_type, forest_data_labeled_raw['Cover_Type']
X_all_train, X_test, y_all_train, y_test = train_test_split(X, y, test_size=0.2,
random_state=12345, shuffle=True)
X_train, X_dev, y_train, y_dev = train_test_split(X_all_train, y_all_train, test_size=0.2,
random_state=12345, shuffle=True)
missing_val_count = pd.DataFrame(forest_data_labeled_raw.isnull().sum())
missing_val_count.columns = ['num_missing']
print('Number of rows w/ missing values: ', sum(missing_val_count['num_missing']))
print('Unique data types for all features:')
print(np.unique(forest_data_labeled_raw.dtypes.values))
print('\nTherefore, all features are 64 bit integers.')
plt.figure(figsize=(8, 6))
ax = sns.countplot(x ='Wilderness_Areas', data = df_engineered)
ax.set_title('Counts for each Wilderness Area in Train Data', fontsize=16)
ax.set_xlabel('Wilderness Area Name', fontsize=14, labelpad=12)
ax.set_ylabel('Count', fontsize=14, labelpad=12)
ax.tick_params(labelsize=12)
plt.show()
plt.figure(figsize=(9,9))
plt.title('Heatmap of Correlations between all Numeric Features in Training Set', fontsize=14, y=1.04)
plt.tick_params(labelsize=12)
numeric_data = X_all_train.drop(X_all_train.columns[10:54], axis=1) #only measure correlation for numerical features, exclude categorical features like Wilderness_Area, Soil Type, and Cover Type
sns.heatmap(numeric_data.corr(),cbar=True,annot=True,cmap='Greens')
plt.show()
cover_types = list(df_engineered['Cover_Type_Name'].unique())
table = ['Elevation', 'Horizontal_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Horizontal_Distance_To_Fire_Points']
fig, axes = plt.subplots(2, 2, figsize=(18, 16))
for c in cover_types:
for t in table:
subset = df_engineered[df_engineered['Cover_Type_Name'] == c]
if t == 'Elevation':
sns.distplot(subset[t], hist = False, kde = True, kde_kws = {'linewidth': 3}, label = c, ax = axes[0, 0]).set_title('Density Plot of Forest Cover by Elevation', fontsize=14, y=1.05)
axes[0, 0].tick_params(labelsize=12)
axes[0, 0].legend()
elif t == 'Horizontal_Distance_To_Hydrology':
sns.distplot(subset[t], hist = False, kde = True, kde_kws = {'linewidth': 3}, label = c, ax = axes[0, 1]).set_title('Density Plot of Forest Cover by Horizontal Distance To Hydrology', fontsize=14)
axes[0, 1].tick_params(labelsize=12)
axes[0, 1].legend()
elif t == 'Horizontal_Distance_To_Roadways':
sns.distplot(subset[t], hist = False, kde = True, kde_kws = {'linewidth': 3}, label = c, ax = axes[1, 0]).set_title('Density Plot of Forest Cover by Horizontal Distance To Roadways', fontsize=14)
axes[1, 0].tick_params(labelsize=12)
axes[1, 0].legend()
else:
sns.distplot(subset[t], hist = False, kde = True, kde_kws = {'linewidth': 3}, label = c, ax = axes[1, 1]).set_title('Density Plot of Forest Cover by Horizontal Distance To Fire Points', fontsize=14)
axes[1, 1].tick_params(labelsize=12)
axes[1, 1].legend()
See plot in fullscreen here.
We used Logistics Regression with parameters used in class.
# Scale features
scalar_baseline = MinMaxScaler()
scalar_baseline.fit(X_train)
scaled_baseline = scalar_baseline.fit_transform(X_train)
X_train_scaled_baseline = pd.DataFrame(scaled_baseline, columns=X_train.columns)
# Train model
lr_baseline = LogisticRegression(C=5, solver='liblinear', multi_class='auto')
lr_baseline .fit(X_train_scaled_baseline, y_train)
# Evaluate model
y_dev_preds_baseline = lr_baseline.predict(X_dev)
held = metrics.classification_report(y_dev, y_dev_preds_baseline, output_dict= True)
baseline_f1 = round(held['weighted avg']['f1-score']*100, 2)
baseline_acc = round(lr_baseline.score(X_dev, y_dev)*100,2)
print("Baseline Model")
print("--------------")
print("Weighted F1-Score: " + str(baseline_f1) + "%")
print("Accuracy: " + str(baseline_acc) + "%")
# Perceptron
perceptron = Perceptron(max_iter=5, random_state=12345)
perceptron.fit(X_train, y_train)
f1_perceptron = round(f1_score(y_dev, perceptron.predict(X_dev), average='weighted') * 100, 2)
print(f1_perceptron, '%')
# KNN
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, y_train)
f1_knn = round(f1_score(y_dev, knn.predict(X_dev), average='weighted') * 100, 2)
print(f1_knn, '%')
# Gaussian Naive Bayes
gaussian = GaussianNB()
gaussian.fit(X_train, y_train)
f1_gaussian = round(f1_score(y_dev, gaussian.predict(X_dev), average='weighted') * 100, 2)
print(f1_gaussian, '%')
# Decision Tree
decision_tree = DecisionTreeClassifier(random_state=12345)
decision_tree.fit(X_train, y_train)
f1_decision_tree = round(f1_score(y_dev, decision_tree.predict(X_dev), average='weighted') * 100, 2)
print(f1_decision_tree, '%')
# Random Forest
random_forest = RandomForestClassifier(n_estimators=100, random_state=12345)
random_forest.fit(X_train, y_train)
f1_random_forest = round(f1_score(y_dev, random_forest.predict(X_dev), average='weighted') * 100, 2)
print(f1_random_forest, '%')
# Logistic Regression
logreg = LogisticRegression(random_state=12345)
logreg.fit(X_train, y_train)
f1_logreg = round(f1_score(y_dev, logreg.predict(X_dev), average='weighted') * 100, 2)
print(f1_logreg, '%')
# Stochastic Gradient Descent (SGD) learning
sgd = linear_model.SGDClassifier(max_iter=5, tol=None, random_state=12345)
sgd.fit(X_train, y_train)
f1_sgd = round(f1_score(y_dev, sgd.predict(X_dev), average='weighted') * 100, 2)
print(f1_sgd, '%')
# Linear SVC (Support Vector Machines)
linear_svc = LinearSVC(random_state=12345)
linear_svc.fit(X_train, y_train)
f1_linear_svc = round(f1_score(y_dev, linear_svc.predict(X_dev), average='weighted') * 100, 2)
print(f1_linear_svc, '%')
Which feature sets yielded the best performance?
We'll move forward with the second model since it has the best performance and is faster than the first model.
k_range = list(range(1, 31))
param_grid = dict(n_neighbors=k_range)
grid = GridSearchCV(estimator=KNeighborsClassifier(),
param_grid=param_grid, verbose=1, cv=10, scoring='f1_weighted',
n_jobs=-1, return_train_score=True)
grid_search=grid.fit(X_train[numeric_features], y_train)
knn_gridcv_f1 = grid_search.best_score_ *100
print("F1-score for our training dataset with GridSearchCV tuning is:\n{:.2f}%".format(knn_gridcv_f1) )
print('Best estimator for GridSearchCV:')
print(grid_search.best_estimator_)
Our concern with overfitting still stands, so we’ll stay with k=1.
Based on our findings and reasoning from above, our final KNN model will have only numeric features and k=3.
# Final KNN Model - numeric features only, k=3
numeric_features = ['Elevation', 'Aspect', 'Slope', 'Horizontal_Distance_To_Hydrology', 'Vertical_Distance_To_Hydrology', 'Horizontal_Distance_To_Roadways', 'Hillshade_9am', 'Hillshade_Noon', 'Hillshade_3pm', 'Horizontal_Distance_To_Fire_Points']
knn_final_model = KNeighborsClassifier(n_neighbors=3)
knn_final_model.fit(X_train[numeric_features], y_train)
y_dev_preds_knn = knn_final_model.predict(X_dev[numeric_features])
f1_knn_final = round(f1_score(y_dev, y_dev_preds_knn, average='weighted') * 100, 2)
print('Final KNN Model')
print('---------------')
print('Dev f1-score: ' + str(f1_knn_final) + '%')
basic_random_forest = RandomForestClassifier(random_state=12345) # num trees in forest n_estimators=100 by default
basic_random_forest.fit(X_train, y_train)
y_predict_rf = basic_random_forest.predict(X_dev)
f1_rf_basic = round(f1_score(y_dev, y_predict_rf, average='weighted') * 100, 2)
print('Basic Random Forest Dev f1-score: ' + str(f1_rf_basic) + '%')
rf_cv_scores = cross_val_score(basic_random_forest, X_train, y_train, cv=10, scoring='f1_weighted')
print('Running K-Fold Cross Validation on Basic Random Forest Model')
print('------------------------------------------------------------')
print('F1-Scores: ', rf_cv_scores)
print('Mean: ', rf_cv_scores.mean())
print('Standard Deviation: ', rf_cv_scores.std())
param_grid = { "criterion" : ["gini", "entropy"], "max_features" : [10, 17, 30], "n_estimators": [100, 500, 600]}
rf_gridcv = RandomForestClassifier(random_state=12345)
rf_clf = GridSearchCV(estimator=rf_gridcv, param_grid=param_grid, cv=10, scoring='f1_weighted')
rf_clf.fit(X_all_train, y_all_train)
print('Best estimators for GridSearchCV with Random Forest: ', rf_clf.best_estimator_)
random_forest_model_gridcv = RandomForestClassifier(n_estimators=500, max_features=17, random_state=12345)
grid_cv_scores_reduced_f1 = cross_val_score(random_forest_model_gridcv, X_all_train, y_all_train, cv=10, scoring='f1_weighted')
print('Random Forest using GridSearchCV parameters and 10-fold Cross Validation F1-scoring:')
print('------------------------------------------------------------------------------------')
print('Scores: ', grid_cv_scores_reduced_f1)
print('Mean: ', grid_cv_scores_reduced_f1.mean())
print('Standard Deviation: ', grid_cv_scores_reduced_f1.std())
Random Forest with hyperparameters from GridSearchCV increases the cross validation score by 2%, which is pretty significant.
random_forest_final_model = RandomForestClassifier(n_estimators=500, max_features=17, random_state=12345)
random_forest_final_model.fit(X_train, y_train)
y_dev_preds_rf = random_forest_final_model.predict(X_dev)
f1_rf_final = round(f1_score(y_dev, y_dev_preds_rf, average='weighted') * 100, 2)
print('Final Random Forest Model')
print('-------------------------')
print('Dev F1-score: ' + str(f1_rf_final) + '%')
After running our base line model, we chose some parameters intuitively:
class_weight='balanced' - class_weight was set to “balanced” because we felt that it would help with the data imbalances.max_iter=500 - We tried several values for max_iter , and the smallest value that would converge was 500solver='newton_cg' - We chose solver = “newton_cg” because it was recommended for multiclass problems and did not immediately need preprocessing.random_state=1234 - We tried to keep random states 1234 consistent throughout all of the modelsmuti_class='auto - we used muti_class= “auto” because the model should choose multinomial on its own with out directionlr_manual1 = LogisticRegression(class_weight='balanced', max_iter=500, solver='newton-cg', random_state=1234, multi_class="auto" ) #run the model
lr_manual1.fit(X_train, y_train)
lr_manual1_y_preds = lr_manual1.predict(X_dev)
lr_manual1_f1 = f1_score(y_dev, lr_manual1_y_preds, average='weighted')
lr_manual1_f1 = round(lr_manual1_f1 * 100, 2)
print("Logistic Regression with Manual Hyperparameter Tuning")
print("-----------------------------------------------------")
print("Dev f1-score: " + str(lr_manual1_f1) + "%")
Hill shade is an incdental measure of light, based on elevation and geographic shape: see article
Soil types if they have a relatioship are related to Elevation, Wilderness Type, Hydrology, Roadways. Based on some additonal research hillshade is a funtion of landmass and topographic standard. Connected to Aspect. I think we can drop soil types as long as wilderness is kept.
Looking at the correlation map between the categorical soil type as a function wilderness type, there is the possibility that the strengths are erroneous, but there is also the possibility that a relationship exists. Let’s look at the counts of soil type by wilderness to see.
The signal is so mixed in the counts of soil type by wilderness. And strong signals such as soil type 29 in wilderness area 1 don’t line up with what is expected based on the strengths represented in our multiple iterations of correlation maps. Given that, let’s try to drop all of the soil types since their data is sparse and they add dimensionality.
We will also attempt to drop hillshade based on some external research that suggests that hillshade is a calculated variable based on sun position (aspect) and physical topographic features.
Data pre processing, based on some additional EDA, and the fact that lostic regession wants the features to be int64 I will take out all of the soil types because of their relationship to Wilderness type; and the hillshade noon, 9am and 3am variables because of their relationship to aspect.
Let's see if the model will improve if we drop the following feature sets:
# Model 1: Dropped selected hill features
print('Model 1: Dropped selected hill features')
print('---------------------------------------')
lr_cat_drop1 = LogisticRegression(class_weight='balanced', max_iter=500, solver='newton-cg', random_state=1234, multi_class="auto")
lr_cat_drop1.fit(no_hill, y_train)
lr_cat_drop1_preds = lr_cat_drop1.predict(dev_no_hill)
print(classification_report(y_dev, lr_cat_drop1_preds))
# Model 2: Dropped selected hill and selected soil features
print('Model 2: Dropped selected hill and selected soil features')
print('---------------------------------------------------------')
lr_cat_drop2 = LogisticRegression(class_weight='balanced', max_iter=500, solver='newton-cg', random_state=1234, multi_class="auto")
lr_cat_drop2.fit(no_soil_hill, y_train)
lr_cat_drop2_preds = lr_cat_drop2.predict(dev_no_soil_hill)
print(classification_report(y_dev, lr_cat_drop2_preds))
# Model 3: Dropped selected soil features
print('Model 3: Dropped selected soil features')
print('---------------------------------------')
lr_cat_drop3 = LogisticRegression(class_weight='balanced', max_iter=500, solver='newton-cg', random_state=1234, multi_class="auto")
lr_cat_drop3.fit(no_soil, y_train)
lr_cat_drop3_preds = lr_cat_drop3.predict(dev_no_soil)
print(classification_report(y_dev, lr_cat_drop3_preds))
Spot removing and partial removing Hillshade and soil variables hurt the model's prediction power. Therefore we'll be keeping all features for our Logistic Regression model.
#citation: https://dwbi1.wordpress.com/2021/04/16/logistic-regression-with-pca-in-python/
pca = PCA(random_state=1234)
pca.fit(X_train)
explained_variance = np.cumsum(pca.explained_variance_ratio_)
y_max_val = 1.05
fig, ax = plt.subplots(1, 1, figsize=(6, 4))
ax.plot(explained_variance)
ax.vlines(x=2, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=5, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=2, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=7, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=10, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=15, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.vlines(x=20, ymax=y_max_val, ymin=0, colors='r', linestyles='--', alpha=0.6)
ax.hlines(y=0.95, xmax=54, xmin=0, colors='g', linestyles='--', alpha=0.6)
ax.set_title('Principal Components vs.\nExplained Variance', fontsize=12)
ax.set_xlabel('Number of Principal Components', fontsize=9, labelpad=10)
ax.set_ylabel('Cumulative\nExplained Variance', fontsize=9, labelpad=10)
ax.set_ylim(0.7, y_max_val)
plt.tight_layout()
plt.show()
We tried some initial PCA exploration, but after viewing our visualization of principle components compared to the amount of cumulative explained variance, it was clear something was off. The best number of principle components was 1. We re-reviewed PCA and because our data is categorical, it does not map well to “sequences of direction vectors that differ on basis of best-fit lines” (source). Since at the most elemental level our data likely has a discrete probability distribution. We stopped here and moved on to trying to hyper tune the model.
Running GridSearchCV yielded the following optimal parameters:
C=0.7 - regularization strengthclass_weight='balanced'max_iter=200 - higher values of max_iter increase the possibility of overfittingpenalty='l1' - adds an l1 penalty term, which limits the size of the coefficients, reducing possibility of overfittingsolver='saga' - saga works for multiclass problems, has faster convergence on features with approx the same scale, which is why we do the MinMaxScaling beforehandrandom_state=1234To build the GridSearchCV parameter gridm we refenced the SciKitLearn Documentation, a few articles about logistics regression tuning, bits we had experimented with in class, and the warnings that propagated during testing. Commented out is the full test grid, running in this report is an edited grid for speed. See sources here, here, here, and here.
# Original parameter grid inspired by research from scikit-learn documentation
# param_grid_lr = {
# 'max_iter': [20, 50, 100, 200, 500, 1000],
# 'solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
# 'class_weight': ['balanced'],
# "penalty": ["l1","l2","None"],
# "C": [.0001 ,.1, .2, .3, .5, .7,] # given by project 3
# }
param_grid_lr = {
'max_iter': [20, 200, 500],
'solver': ['lbfgs', 'saga'],
'class_weight': ['balanced'],
"penalty": ["l1","l2","None"],
"C": [0.3, 0.5, 0.7]
}
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_dev_scaled = scaler.transform(X_dev)
logModel_grid = GridSearchCV(estimator=LogisticRegression(random_state=1234),
param_grid=param_grid_lr, verbose=1, cv=10,
n_jobs=-1, return_train_score=True)
logModel_grid.fit(X_train_scaled, y_train)
We tried LogisticRegressionCV but it was slower and did not produce better predictive power nor was its documentation as robust as GridSearchCV so it has been excluded.
scaler = MinMaxScaler(feature_range = (0,1))
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_dev_scaled = scaler.transform(X_dev)
log_reg_final_model = LogisticRegression(C=0.7, class_weight='balanced', max_iter=200, penalty='l1',
random_state=1234, solver='saga')
lr_final_cv_f1 = cross_val_score(log_reg_final_model, X_all_train, y_all_train, cv=10, scoring='f1_weighted') # TODO: take out if takes too long
print('CV Scores: ', lr_final_cv_f1)
print('CV Mean: ', lr_final_cv_f1.mean())
print('CV Standard Deviation: ', lr_final_cv_f1.std())
log_reg_final_model.fit(X_train_scaled, y_train)
y_dev_predictions_lr_final = log_reg_final_model.predict(X_dev_scaled)
f1_lr_final = round(f1_score(y_dev, y_dev_predictions_lr_final, average='weighted') * 100, 2)
print('Dev F1-score: ' + str(f1_lr_final) + '%')
On this particular data set logistic regression performs just “average”. Intuition and experimentation lead us to believe this is because the regression is struggling to interpret the signals in the soil type and the hillside variables. Colloquially the model “can tell” they are important variables but does know exactly how inform the final decision.
# KNN
y_test_preds_knn = knn_final_model.predict(X_test[numeric_features])
f1_knn_final_test = round(f1_score(y_test, y_test_preds_knn, average='weighted') * 100, 2)
# Random Forest
y_test_preds_rf = random_forest_final_model.predict(X_test)
f1_rf_final_test = round(f1_score(y_test, y_test_preds_rf, average='weighted') * 100, 2)
# Logistic Regression
X_test_scaled = scaler.transform(X_test)
y_test_preds_lr = log_reg_final_model.predict(X_test_scaled)
f1_lr_final_test = round(f1_score(y_test, y_test_preds_lr, average='weighted') * 100, 2)
# KNN Incorrects
X_dev_all_features_knn = X_dev.copy()
X_dev_all_features_knn['Predicted_Cover'] = y_dev_preds_knn
X_dev_all_features_knn['Actual_Cover'] = y_dev
incorrect_knn_df = X_dev_all_features_knn[X_dev_all_features_knn['Actual_Cover']!=X_dev_all_features_knn['Predicted_Cover']]
incorrect_knn_df = incorrect_knn_df[incorrect_knn_df['Actual_Cover']==2]
incorrect_knn_df = incorrect_knn_df[incorrect_knn_df['Predicted_Cover']==1]
# Random Forest Incorrects
X_dev_all_features_rf = X_dev.copy()
X_dev_all_features_rf['Predicted_Cover'] = y_dev_preds_rf
X_dev_all_features_rf['Actual_Cover'] = y_dev
incorrect_rf_df = X_dev_all_features_rf[X_dev_all_features_rf['Actual_Cover']!=X_dev_all_features_rf['Predicted_Cover']]
incorrect_rf_df = incorrect_rf_df[incorrect_rf_df['Actual_Cover']==2]
incorrect_rf_df = incorrect_rf_df[incorrect_rf_df['Predicted_Cover']==1]
# Logistic Regression Incorrects
X_dev_all_features_lr = X_dev.copy()
X_dev_all_features_lr['Predicted_Cover'] = y_dev_preds_rf
X_dev_all_features_lr['Actual_Cover'] = y_dev
incorrect_lr_df = X_dev_all_features_lr[X_dev_all_features_lr['Actual_Cover']!=X_dev_all_features_lr['Predicted_Cover']]
incorrect_lr_df = incorrect_lr_df[incorrect_lr_df['Actual_Cover']==2]
incorrect_lr_df = incorrect_lr_df[incorrect_lr_df['Predicted_Cover']==1]
Above incorrect plots by WA count for KNN and RF show us that errors could be influenced by imbalances in counts for each wilderness area type.
We choose the Random Forest model based on the criteria we defined earlier:
To improve upon our work, it would be good to consider the following in the future:
Photo by Shane Rounce on Unsplash
Jing
Stephanie Cabanela
Catherine Jimerson
Geet Samra